Explorative Data Analysis (EDA)

Here we perform some basic EDA operations.

In [1]:
import os
import pandas as pd
import geopandas as gpd
import numpy as np
import feather
from pandas_profiling import ProfileReport
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sn
import psutil
from translate import Translator
In [2]:
os.chdir('..')
os.chdir('data')
os.getcwd()
Out[2]:
'C:\\Users\\tpytsui\\Documents\\Surfdrive\\Documents\\_PhD\\_github\\lma_circular-maker-city\\data'
In [3]:
df = feather.read_dataframe('lma/af_2019-2020_matched-codes.feather')

Finding main flow types

For our research, we are looking at the location of 'eerste afnemers', the receivers of secondary resources. However, there is a problem - some locations of 'eerste afnemers' actually represent the location of the offices of the reusing company, rather than the actual location of reuse. For example, for the process of using rubble as a foundation for roads, the eerste afnemers' locations are the locations of the contractors' offices, not the location of the roads that are being filled.

Because of this, we need to talk to experts from LMA to determine which flows are less likely to have accurate 'eerste afnemers' locations (such as rubble), and which could be more accurate (metal, plastic). The purpose of this section is therefore to prepare a list of major flow types from the LMA dataset, which we can use to discuss with LMA experts. The types will be categorized by VMC (processing type), SBI (industry type) and EWC/GNC (material type).

In [4]:
# clean df 
df = feather.read_dataframe('lma/af_2019-2020_matched-codes.feather')
df = df[['eaNaam', 'gnc', 'ewc', 'vmc', 'sbi', 'kg']]
df.sbi = df.sbi.replace('nan', '0')
df = df.replace([None, 'nan', '0'], '--')

# change codes to chapters
df.gnc = df.gnc.str[:2]
df.ewc = df.ewc.str[:2]
df.vmc = df.vmc.str[:1]

# sbi chapters 
df.sbi = df.sbi.str.split(',')
def chap(sbis): 
    newSbis = []
    for sbi in sbis: 
        newSbis.append(sbi[:2])
    return list(set(newSbis))
df.sbi = df.sbi.map(lambda x: chap(x))
df['sbiLen'] = df.sbi.map(lambda x: len(x))
df.sbi = df.sbi.map(lambda x: ','.join(x))

Grouping by GNC, EWC, VMC, and SBI chapters

There are lots of different types of SBI chapters because many companies have multiple sbi codes, creating many different combinations

In [5]:
# flow types
ft = df.groupby(['gnc', 'ewc', 'vmc', 'sbi']).sum().reset_index().sort_values('kg', ascending=False)

# # add chapter text 
# ewcC = feather.read_dataframe('classification/ewc.feather')
# ft = pd.merge(ft, ewcC, how='left', on='ewc')

# vmcC = feather.read_dataframe('classification/vmc.feather')
# ft = pd.merge(ft, vmcC, how='left', on='vmc')

# gncC = feather.read_dataframe('classification/gnc.feather')
# gncC = gncC[['Code', 'descDetail']]
# gncC.rename(columns={'Code': 'gnc', 'descDetail': 'gncDesc'}, inplace=True)
# ft = pd.merge(ft, gncC, how='left', on='gnc')

ft['type'] = 'gnc:' + ft.gnc + ' ewc:' + ft.ewc + ' vmc:' + ft.vmc + ' sbi:' + ft.sbi

ft.head()
Out[5]:
gnc ewc vmc sbi kg sbiLen type
1317 25 -- B -- 5179703589 5002 gnc:25 ewc:-- vmc:B sbi:--
710 -- 19 B -- 3085370931 5024 gnc:-- ewc:19 vmc:B sbi:--
405 -- 17 B -- 2985070432 3849 gnc:-- ewc:17 vmc:B sbi:--
1386 25 -- B 43 1432089123 1173 gnc:25 ewc:-- vmc:B sbi:43
1380 25 -- B 42 1295399925 965 gnc:25 ewc:-- vmc:B sbi:42
In [6]:
fig, ax = plt.subplots(1,2,figsize=(9*2,5))

ax[0].pie(ft.kg.head(20))
ax[0].legend(ft.type.head(20), loc='center left', bbox_to_anchor=(1,0.5))
ax[0].set_title('top 20 flow types from afgifte dataset')

ax[1].pie(ft.kg)
ax[1].set_title('all flow types ({} in total)'.format(len(ft.kg)))

plt.show()

Here, we tried to categorize the afgifte flows using the gnc, ewc, vmc, and sbi. As seen above, there are 1864 flow types, which is a lot. The reason why there are so many flow types is because of the SBI codes - many companies have multiple sbi codes, as many as thirty!

Quick look at SBI codes

In [7]:
print('how many sbi codes do companies have in the dataset?')
for i in sorted(df.sbiLen.unique()): 
    numCom = len(df[df.sbiLen == i])
    print('{} companies have {} sbi codes'.format(numCom, i))
how many sbi codes do companies have in the dataset?
52194 companies have 1 sbi codes
2223 companies have 2 sbi codes
567 companies have 3 sbi codes
314 companies have 4 sbi codes
194 companies have 5 sbi codes
58 companies have 6 sbi codes
19 companies have 7 sbi codes
16 companies have 8 sbi codes
8 companies have 17 sbi codes
8 companies have 30 sbi codes
In [8]:
sbiLen = df.groupby('sbiLen').sum().sort_values('kg', ascending=False).reset_index()

fig, ax = plt.subplots(1,1,figsize=(9*2,5))

ax.pie(sbiLen.kg)
ax.legend(sbiLen.sbiLen, loc='center left', bbox_to_anchor=(1,0.5))
ax.set_title('% waste (in kg) associated with x number of sbis')

plt.show()
In [9]:
df[df.sbiLen == 30]
Out[9]:
eaNaam gnc ewc vmc sbi kg sbiLen
24143 gemeente amsterdam 31 -- B 52,91,32,55,71,81,45,74,38,88,93,02,64,62,77,9... 1720 30
27407 stadsdeel oost 25 -- B 52,91,32,55,71,81,45,74,38,88,93,02,64,62,77,9... 3840 30
28067 gemeente amsterdam -- 17 A 52,91,32,55,71,81,45,74,38,88,93,02,64,62,77,9... 1911140 30
28068 gemeente amsterdam -- 17 B 52,91,32,55,71,81,45,74,38,88,93,02,64,62,77,9... 561000 30
28093 stadsdeel oost -- 17 B 52,91,32,55,71,81,45,74,38,88,93,02,64,62,77,9... 187400 30
29315 dienst ivv materiaaldienstgladheidsbestrijd 25 -- B 52,91,32,55,71,81,45,74,38,88,93,02,64,62,77,9... 540000 30
29316 dienst ivv materiaaldienstgladheidsbestrijd 25 -- B 52,91,32,55,71,81,45,74,38,88,93,02,64,62,77,9... 2168520 30
38950 gemeente amsterdam -- 02 B 52,91,32,55,71,81,45,74,38,88,93,02,64,62,77,9... 583290 30
In [10]:
# 30 associated sbi codes of gemeente amsterdam  
# maker pd.series of the 30 sbi codes 
ams = df[df.sbiLen == 30].sbi.iloc[0]
ams = ams.split(',')
ams = pd.DataFrame(ams, columns=['sbi'])

# merge with sbi chapters 
sbiChap = feather.read_dataframe('classification/sbi_Headings.feather')
ams = pd.merge(ams, sbiChap[['sbi', 'sbiDesc']], how='left', on='sbi')

# display 
ams.head(10)
Out[10]:
sbi sbiDesc
0 52 Opslag en dienstverlening voor vervoer
1 91 Culturele uitleencentra, openbare archieven, m...
2 32 Vervaardiging van overige goederen
3 55 Logiesverstrekking
4 71 Architecten, ingenieurs en technisch ontwerp e...
5 81 Facility management, reiniging en landschapsve...
6 45 Handel in en reparatie van auto’s, motorfietse...
7 74 Industrieel ontwerp en vormgeving, fotografie,...
8 38 Afvalinzameling en -behandeling; voorbereiding...
9 88 Maatschappelijke dienstverlening zonder overna...

Group by GNC, EWC, VMC, SBI chapters; but include SBI codes for companies with 1 sbi

For companies with multiple SBI codes, their sbi code becomes '--'

In [11]:
dfOneSbi = df
dfOneSbi.loc[df.sbiLen > 1, 'sbi'] = '--'
dfOneSbi = dfOneSbi.groupby(['gnc', 'ewc', 'vmc', 'sbi']).sum().kg.reset_index()
dfOneSbi['type'] = 'gnc:' + dfOneSbi.gnc + ' ewc:' + dfOneSbi.ewc + ' vmc:' + dfOneSbi.vmc + ' sbi:' + dfOneSbi.sbi
dfOneSbi = dfOneSbi.sort_values('kg', ascending=False)
In [12]:
num = len(dfOneSbi.sbi.unique())
print('number of unique sbi chapters: {}'.format(num))
number of unique sbi chapters: 72
In [13]:
fig, ax = plt.subplots(1,2,figsize=(9*2,5))

ax[0].pie(dfOneSbi.kg.head(20))
ax[0].legend(dfOneSbi.type.head(20), loc='center left', bbox_to_anchor=(1,0.5))
ax[0].set_title('top 20 flow types from afgifte dataset')

ax[1].pie(dfOneSbi.kg)
ax[1].set_title('all flow types ({} in total)'.format(len(dfOneSbi.kg)))

plt.show()

Group by GNC, EWC, VMC chapters and SBI sections; but include SBI codes for companies with 1 sbi

Now let's try to use sbi section headers to categorize the flows.

In [14]:
# read sbiChap, which includes chapter headings and which section each chapter belongs to. 
sbiChap = feather.read_dataframe('classification/sbi_Chapters.feather')
sbiChap = sbiChap[['section', 'sbi', 'sbiDesc']]

# make df with sbi section heads
dfSbiSec = dfOneSbi.copy()
dfSbiSec.reset_index(inplace=True, drop=True)
dfSbiSec['sbiSec'] = pd.merge(dfSbiSec.sbi, sbiChap, how='left', on='sbi').section
dfSbiSec.sbiSec.fillna('--', inplace=True)

# groupby
dfSbiSec = dfSbiSec.groupby(['gnc', 'ewc', 'vmc', 'sbiSec']).sum().reset_index().sort_values('kg', ascending=False)
dfSbiSec['type'] = 'gnc:' + dfSbiSec.gnc + ' ewc:' + dfSbiSec.ewc + ' vmc:' + dfSbiSec.vmc + ' sbi:' + dfSbiSec.sbiSec
In [15]:
print('number of unique sbi sections: {}'.format(len(dfSbiSec.sbiSec.unique())))
number of unique sbi sections: 20
In [16]:
fig, ax = plt.subplots(1,2,figsize=(9*2,5))

ax[0].pie(dfSbiSec.kg.head(20))
ax[0].legend(dfSbiSec.type.head(20), loc='center left', bbox_to_anchor=(1,0.5))
ax[0].set_title('top 20 flow types from afgifte dataset')

ax[1].pie(dfSbiSec.kg)
ax[1].set_title('all flow types ({} in total)'.format(len(dfSbiSec.kg)))

plt.show()

Group by GNC sections | EWC and VMC chapters | SBI sections; but include SBI codes for companies with 1 sbi

In [17]:
# make copy for new df
dfGncSec = dfSbiSec.copy()
dfGncSec.drop(labels=['type'], axis=1, inplace=True)
dfGncSec.reset_index(inplace=True, drop=True)

# add gnc section headers 
gncSec = feather.read_dataframe('classification/gnc_Headings.feather')
dfGncSec['gncSec'] = pd.merge(dfGncSec.gnc, gncSec[['gnc', 'section']], how='left', on='gnc').section
dfGncSec = dfGncSec[['gnc','gncSec', 'ewc', 'vmc', 'sbiSec', 'kg']]
dfGncSec.gncSec.replace(np.NaN, '--', inplace=True)

# groupby 
dfGncSec = dfGncSec.groupby(['gncSec', 'ewc', 'vmc', 'sbiSec']).sum().reset_index().sort_values('kg', ascending=False)
dfGncSec['type'] = 'gnc:' + dfGncSec.gncSec + ' ewc:' + dfGncSec.ewc + ' vmc:' + dfGncSec.vmc + ' sbi:' + dfGncSec.sbiSec
In [18]:
fig, ax = plt.subplots(1,2,figsize=(9*2,5))

ax[0].pie(dfGncSec.kg.head(20))
ax[0].legend(dfGncSec.type.head(20), loc='center left', bbox_to_anchor=(1,0.5))
ax[0].set_title('top 20 flow types from afgifte dataset')

ax[1].pie(dfGncSec.kg)
ax[1].set_title('all flow types ({} in total)'.format(len(dfGncSec.kg)))

plt.show()

Group by GNC, EWC, VMC chapters (and without SBI)

Without the sbis, there are less types - only 118 in total, see below.

In [19]:
ftNoSbi = ft.groupby(['gnc', 'ewc', 'vmc']).sum().kg.reset_index().sort_values('kg', ascending=False)
ftNoSbi['type'] = 'gnc:' + ftNoSbi.gnc + ' ewc:' + ftNoSbi.ewc + ' vmc:' + ftNoSbi.vmc

fig, ax = plt.subplots(1,2,figsize=(9*2,5))

ax[0].pie(ftNoSbi.kg.head(20))
ax[0].legend(ftNoSbi.type.head(20), loc='center left', bbox_to_anchor=(1,0.5))
ax[0].set_title('top 20 flow types from afgifte dataset')

ax[1].pie(ftNoSbi.kg)
ax[1].set_title('all flow types ({} in total)'.format(len(ftNoSbi.kg)))

plt.show()

Distribution of gnc, ewc, vmc, and sbi codes in afgifte dataset by weight

In [20]:
# groupby to create pie chart (matplotlib)
# gnc 
gnc = ft.groupby('gnc').sum().reset_index()
gnc = gnc.sort_values('kg', ascending=False)
gncLen = len(gnc)
gnc.loc[gnc['kg'] < 150000000, 'gnc'] = 'other'
gnc = gnc.groupby('gnc').sum().reset_index().sort_values('kg', ascending=False)
gnc = gnc[gnc.gnc != '--']

# ewc
ewc = ft.groupby('ewc').sum().reset_index()
ewc = ewc.sort_values('kg', ascending=False)
ewcLen = len(ewc)
ewc.loc[ewc['kg'] < 800000000, 'ewc'] = 'other'
ewc = ewc.groupby('ewc').sum().reset_index().sort_values('kg', ascending=False)
ewc = ewc[ewc.ewc != '--']

# vmc
vmc = ft.groupby('vmc').sum().reset_index()
vmc = vmc.sort_values('kg', ascending=False)
vmcLen = len(vmc)

# sbi
sbi = ft.groupby('sbi').sum().kg.reset_index().sort_values('kg', ascending=False)
sbiLen = len(sbi)
sbi.loc[sbi['kg'] < 161052464, 'sbi'] = 'other'
sbi = sbi.groupby('sbi').sum().reset_index().sort_values('kg', ascending=False)

# merge gnc and ewc columns to create df of material (mat for short)
gnc['mat'] = 'gnc' + gnc.gnc
ewc['mat'] = 'ewc' + ewc.ewc
mat = pd.concat([gnc, ewc])
mat.drop(labels=['gnc', 'ewc', 'sbiLen'], inplace=True, axis=1)
mat = mat[['mat', 'kg']]
mat.sort_values('kg', ascending=False, inplace=True)
In [21]:
fig, ax = plt.subplots(1,3,figsize=(8*3,6))

ax[0].pie(mat.kg)
ax[0].legend(mat.mat, loc='center left', bbox_to_anchor=(1,0.5))
ax[0].set_title('gnc/ewc')

ax[1].pie(vmc.kg)
ax[1].legend(vmc.vmc, loc='center left', bbox_to_anchor=(1,0.5))
ax[1].set_title('vmc')

ax[2].pie(sbi.kg)
ax[2].legend(sbi.sbi, loc='center left', bbox_to_anchor=(1,0.5))
ax[2].set_title('sbi')

# display 
print('distribution of gnc, ewc, and vmc codes in afgifte dataset by weight')
plt.show()

print('# unique gnc codes: {}'.format(gncLen))
print('# unique ewc codes: {}'.format(ewcLen))
print('# unique vmc codes: {}'.format(vmcLen))
print('# unique sbi codes: {}'.format(sbiLen))
distribution of gnc, ewc, and vmc codes in afgifte dataset by weight
# unique gnc codes: 32
# unique ewc codes: 19
# unique vmc codes: 7
# unique sbi codes: 370

Picking a categorization method and interpreting it

So far, we've tried to categorized the flows in the following ways:

  • Grouping by GNC, EWC, VMC, and SBI chapters (1756 types)
  • Group by GNC, EWC, VMC, SBI chapters; but only including SBI codes for companies with 1 sbi (961 types)
  • Group by GNC, EWC, VMC chapters and SBI sections; but include SBI codes for companies with 1 sbi (586 types)
  • Group by GNC sections | EWC and VMC chapters | SBI sections; but include SBI codes for companies with 1 sbi (520 types)
  • Group by GNC, EWC, VMC chapters (and without SBI) (118 types)

I will pick the third categorization method and add descriptions to it. (The fourth method wasn't chosen because using GNC sections turned out to be too vague.) With this, we can work with LMA experts to identify the flow types with locations that represent the real location of reuse. If it turns out that SBI codes don't really make a difference, then I will move on to the last categorization method, which doesn't include SBI codes.

In [22]:
# ADD DESCRIPTIONS 
# read classification files 
gncDesc = feather.read_dataframe('classification/gnc_Headings.feather')
ewcDesc = feather.read_dataframe('classification/ewc.feather')
vmcDesc = feather.read_dataframe('classification/vmc.feather')
sbiDesc = feather.read_dataframe('classification/sbi_Headings.feather')

# make copy of dfGncSec (dfl stands for df for lma)
dfl = dfSbiSec.copy()

# add descriptions 
dfl = pd.merge(dfl, gncDesc[['gnc', 'gncDesc']], how='left', on='gnc') # gnc
dfl = pd.merge(dfl, ewcDesc, how='left', on='ewc') # ewc
dfl = pd.merge(dfl, vmcDesc, how='left', on='vmc') # vmc
dfl = pd.merge(dfl, sbiDesc[['sbi', 'sbiDesc']], how='left', left_on='sbiSec', right_on='sbi') # sbi

# rearrange columns 
dfl = dfl[['gnc', 'gncDesc', 'ewc', 'ewcDesc', 'vmc', 'vmcDesc', 'sbiSec', 'sbiDesc', 'kg']]

# remove np.NaN
dfl.replace(np.NaN, '--', inplace=True)

# display all flow types with descriptions 
dfl.head(10)
Out[22]:
gnc gncDesc ewc ewcDesc vmc vmcDesc sbiSec sbiDesc kg
0 25 SALT; SULPHUR; EARTHS AND STONE; PLASTERING MA... -- -- B You use the waste immediately. This does not c... -- -- 7747384037
1 -- -- 19 WASTES FROM WASTE MANAGEMENT FACILITIES, OFF-... B You use the waste immediately. This does not c... -- -- 4385371403
2 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... -- -- 3717413086
3 25 SALT; SULPHUR; EARTHS AND STONE; PLASTERING MA... -- -- B You use the waste immediately. This does not c... F Bouwnijverheid 3063316815
4 -- -- 19 WASTES FROM WASTE MANAGEMENT FACILITIES, OFF-... B You use the waste immediately. This does not c... F Bouwnijverheid 2004940810
5 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... F Bouwnijverheid 1319977724
6 25 SALT; SULPHUR; EARTHS AND STONE; PLASTERING MA... -- -- B You use the waste immediately. This does not c... G Groot- en detailhandel; reparatie van auto’s 1205065010
7 25 SALT; SULPHUR; EARTHS AND STONE; PLASTERING MA... -- -- B You use the waste immediately. This does not c... C Industrie 1176602615
8 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... E Winning en distributie van water; afval- en af... 1053072291
9 25 SALT; SULPHUR; EARTHS AND STONE; PLASTERING MA... -- -- B You use the waste immediately. This does not c... M Advisering, onderzoek en overige specialistisc... 925313008
In [23]:
# make different flow types 
dfNoSbi = dfl.groupby(['gnc', 'gncDesc', 'ewc', 'ewcDesc', 'vmc', 'vmcDesc']).sum().sort_values('kg', ascending=False).reset_index() # flow types without sbi
dfMat = dfl.groupby(['gnc', 'gncDesc', 'ewc', 'ewcDesc']).sum().sort_values('kg', ascending=False).reset_index() # material flow types (ewc, gnc)
dfPro = dfl.groupby(['vmc', 'vmcDesc']).sum().sort_values('kg', ascending=False).reset_index() # processing flow types (vmc)
In [24]:
import xlsxwriter
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('lma/flowTypes.xlsx', engine='xlsxwriter')

# save as excel file with multiple sheets 
dfl.to_excel(writer, sheet_name='flowTypes_allCodes')
dfNoSbi.to_excel(writer, sheet_name='flowTypes_noSbi')
dfMat.to_excel(writer, sheet_name='flowTypes_mat')
dfPro.to_excel(writer, sheet_name='flowTypes_pro')

# Close the Pandas Excel writer and output the Excel file.
writer.save()

Understanding flows (material + processing codes) and their associated SBIs

There seem to be some combinations of material & processing codes (e.g. reused construction and demolition waste) that are associated with all SBI codes. This means that all industries, from financial services to construction companies, are involved in reusing C&D waste (see example below).

Not all flow types are associated with multiple SBIs, and this section will further explain this.

In [25]:
row = dfl.iloc[2]
mask = (dfl.gnc == row.gnc) & (dfl.ewc == row.ewc) & (dfl.vmc == row.vmc)
dfl[mask]
Out[25]:
gnc gncDesc ewc ewcDesc vmc vmcDesc sbiSec sbiDesc kg
2 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... -- -- 3717413086
5 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... F Bouwnijverheid 1319977724
8 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... E Winning en distributie van water; afval- en af... 1053072291
17 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... G Groot- en detailhandel; reparatie van auto’s 484783769
23 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... A Landbouw, bosbouw en visserij 328423787
24 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... H Vervoer en opslag 315349738
32 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... K Financiële instellingen 226920545
45 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... N Verhuur van roerende goederen en overige zakel... 142906100
51 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... B Winning van delfstoffen 105442420
61 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... O Openbaar bestuur, overheidsdiensten en verplic... 74524955
66 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... C Industrie 64473525
68 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... M Advisering, onderzoek en overige specialistisc... 61500990
99 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... L Verhuur van en handel in onroerend goed 25971419
121 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... D Productie en distributie van en handel in elek... 14388910
151 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... I Logies-, maaltijd- en drankverstrekking 8347050
169 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... S Overige dienstverlening 6563860
200 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... Q Gezondheids- en welzijnszorg 3618660
247 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... J Informatie en communicatie 1978980
371 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... R Cultuur, sport en recreatie 386390
419 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... P Onderwijs 194860
In [26]:
# finding number of sbis associated for each flow type (flow type = material + processing type)
# count number of sbis associated for each flow type 
def numSbi(row): 
    mask = (dfl.gnc == row.gnc) & (dfl.ewc == row.ewc) & (dfl.vmc == row.vmc)
    return len(dfl[mask])
dfNoSbi['numSbi'] = dfNoSbi.apply(lambda row: numSbi(row), axis=1)

# number of flow types with x number of associated sbis
numSbi = dfNoSbi.groupby('numSbi').count().gnc.reset_index().sort_values('numSbi')
numSbi.rename(columns={'gnc': 'count'}, inplace=True)

# kg of waste with x number of associated sbis 
kgSbi = dfNoSbi.groupby('numSbi').sum().sort_values('kg', ascending=False).reset_index()
In [27]:
# pie chart explaining number of SBIs associated with major flows 
fig, ax = plt.subplots(1,2,figsize=(9*2,6))

# rows with x number of associated sbis (by count)
ax[0].pie(numSbi['count'])
ax[0].legend(numSbi.numSbi, loc='center left', bbox_to_anchor=(1,0.5), title='# associated sbis')
ax[0].set_title('% of flow types with x number of associated sbis')

# kg of waste with x number of associated sbis (by weight)
ax[1].pie(kgSbi.kg)
ax[1].legend(kgSbi.numSbi, loc='center left', bbox_to_anchor=(1,0.5), title='# associated sbis')
ax[1].set_title('kg of waste types with x number of associated sbis')

plt.show()
In [28]:
def asSbi(gnc, ewc, vmc): 
    mask = (dfl.gnc == gnc) & (dfl.ewc == ewc) & (dfl.vmc == vmc)
    return dfl[mask]

asSbi('--', '17', 'B') # flow type with 20 SBIs: reused C&D waste 
Out[28]:
gnc gncDesc ewc ewcDesc vmc vmcDesc sbiSec sbiDesc kg
2 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... -- -- 3717413086
5 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... F Bouwnijverheid 1319977724
8 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... E Winning en distributie van water; afval- en af... 1053072291
17 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... G Groot- en detailhandel; reparatie van auto’s 484783769
23 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... A Landbouw, bosbouw en visserij 328423787
24 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... H Vervoer en opslag 315349738
32 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... K Financiële instellingen 226920545
45 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... N Verhuur van roerende goederen en overige zakel... 142906100
51 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... B Winning van delfstoffen 105442420
61 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... O Openbaar bestuur, overheidsdiensten en verplic... 74524955
66 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... C Industrie 64473525
68 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... M Advisering, onderzoek en overige specialistisc... 61500990
99 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... L Verhuur van en handel in onroerend goed 25971419
121 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... D Productie en distributie van en handel in elek... 14388910
151 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... I Logies-, maaltijd- en drankverstrekking 8347050
169 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... S Overige dienstverlening 6563860
200 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... Q Gezondheids- en welzijnszorg 3618660
247 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... J Informatie en communicatie 1978980
371 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... R Cultuur, sport en recreatie 386390
419 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... B You use the waste immediately. This does not c... P Onderwijs 194860
In [29]:
asSbi('23', '--', 'B') # flow type with 1 sbi: reused food waste 
Out[29]:
gnc gncDesc ewc ewcDesc vmc vmcDesc sbiSec sbiDesc kg
90 23 RESIDUES AND WASTE FROM THE FOOD INDUSTRIES; P... -- -- B You use the waste immediately. This does not c... -- -- 35127520
In [30]:
asSbi('--', '17', 'E') # flow type with 2 sbis: composted c&d waste 
Out[30]:
gnc gncDesc ewc ewcDesc vmc vmcDesc sbiSec sbiDesc kg
128 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... E You treat the waste microbiologically. -- -- 13385120
542 -- -- 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING... E You treat the waste microbiologically. N Verhuur van roerende goederen en overige zakel... 20820
In [31]:
asSbi('--', '15', 'D') # flow type with 3 sbis: mechanically treated waste packaging 
Out[31]:
gnc gncDesc ewc ewcDesc vmc vmcDesc sbiSec sbiDesc kg
337 -- -- 15 WASTE PACKAGING; ABSORBENTS, WIPING CLOTHS, F... D You treat the waste mechanically or physically. G Groot- en detailhandel; reparatie van auto’s 652165
388 -- -- 15 WASTE PACKAGING; ABSORBENTS, WIPING CLOTHS, F... D You treat the waste mechanically or physically. K Financiële instellingen 289692
424 -- -- 15 WASTE PACKAGING; ABSORBENTS, WIPING CLOTHS, F... D You treat the waste mechanically or physically. H Vervoer en opslag 172904

Categorizing by industry type

Questions:

  • What's a good way to categorize sbi codes into industry types?
  • what % of flows (by count, by weight) belongs to making industry, construction industry, other, or un-matched? (or some other categories)
  • for each industry type:
    • mapping
      • how are the flows of this industry spread across the country? how geographically clustered (Moran's I) are the eerste afnemers?
    • stacked bar chart (?)
      • what are the materials associated with this industry?
      • what are the processing types associated with this industry?

What's a good way to categorize sbi codes into industry types?

SBI codes are actually the same as NACE codes - the first 4 digits of SBI are the same as the first 4 digits of NACE. The following section headings can be considered as 'making':

  • A - agriculture, forestry, and fishing
  • C - manufacturing (this could potentially be sub-categorized)
  • F - construction

Note: I would like to do all three industry types, because this allows us to identify similarities and differences of the different industries. However, as you will see in the following sections, the three industries differ quite a bit in terms of amount of kg received, as well as spatial distribution of the waste received. You could argue that the flows are so different that they're not even comparable.

In [32]:
sbi = feather.read_dataframe('classification/sbi_Headings.feather')
secMake = ['A', 'C', 'F']
sbiMake = sbi[sbi.section.isin(secMake)]
print('number of chapters included as making industry: {}'.format(len(sbiMake) - len(secMake)))
print(sbiMake.sbi.unique())
sbiMake.head()
number of chapters included as making industry: 30
['A' '01' '02' '03' 'C' '10' '11' '12' '13' '14' '15' '16' '17' '18' '19'
 '20' '21' '22' '23' '24' '25' '26' '27' '28' '29' '30' '31' '32' '33' 'F'
 '41' '42' '43']
Out[32]:
sbi sbiDesc section
0 A Landbouw, bosbouw en visserij A
1 01 Landbouw, jacht en dienstverlening voor de lan... A
2 02 Bosbouw, exploitatie van bossen en dienstverle... A
3 03 Visserij en kweken van vis en schaaldieren A
8 C Industrie C

What % of flows (by count, by weight) belongs to each industry type?

In [33]:
# make dataframe of kg associated with sbi codes
dfs = ft.groupby('sbi').sum().kg.reset_index().sort_values('kg', ascending=False)

# mark out multiple sbi codes as 'mul'
def sbiMul(x): 
    if len(x) > 2: 
        return 'mul'
    else: 
        return x 
dfs.sbi = dfs.sbi.map(lambda x: sbiMul(x))
dfs = dfs.groupby('sbi').sum().reset_index().sort_values('kg', ascending=False)

# merge with section headings 
dfs = pd.merge(dfs, sbi, how='left', on='sbi')
dfs.section = dfs.section.replace(np.NaN, '--')
def na(row): 
    if row.section == '--': 
        row.section = row.sbi 
    return row 
dfs = dfs.apply(lambda row: na(row), axis=1)
dfs = dfs.groupby(['section']).sum().reset_index().sort_values('kg', ascending=False)

# matched? 
def match(x): 
    if x == '--': 
        return 'sbi unknown'
    elif x == 'mul': 
        return 'multiple sbis'
    else: 
        return 'matched'
dfs['match'] = dfs.section.map(lambda x: match(x))

# making vs non-making 
def making(x): 
    if x in secMake: 
        return x
    elif x == '--': 
        return 'sbi unknown'
    elif x == 'mul': 
        return 'multiple sbis'
    else: 
        return 'non-maker'
dfs['maker'] = dfs.section.map(lambda x: making(x))
    
# display 
dfs.head()
Out[33]:
section kg match maker
0 -- 14776643959 sbi unknown sbi unknown
6 F 7062574041 matched F
20 mul 5727018459 multiple sbis multiple sbis
7 G 3075864017 matched non-maker
5 E 2874364254 matched non-maker
In [34]:
# pie chart explaining number of SBIs associated with major flows 
fig, ax = plt.subplots(1,3,figsize=(8*3,6))

# explode -- and mul values 
explode = np.zeros(len(dfs))
explode[[0,2]] = 0.2

# rows with x number of associated sbis (by count)
ax[0].pie(dfs.kg)
ax[0].legend(dfs.section, loc='center left', bbox_to_anchor=(1,0.5), title='sbi section')
ax[0].set_title('sbi code sections (by weight)')

# matched vs unmatched sbis 
dfss = dfs.groupby('match').sum().reset_index().sort_values('kg', ascending=False)
ax[1].pie(dfss.kg)
ax[1].legend(dfss.match, loc='center left', bbox_to_anchor=(1,0.5))
ax[1].set_title('matched vs unmatched sbi codes (by weight)')

# maker vs non-maker sbis 
dfsm = dfs.groupby('maker').sum().reset_index().sort_values('kg', ascending=False)
colors = ['#e8e8e8', '#adadad', '#41ab66', '#e8e8e8', '#639beb', '#e84f4f']
ax[2].pie(dfsm.kg, colors=colors)
ax[2].legend(dfsm.maker, loc='center left', bbox_to_anchor=(1,0.5))
ax[2].set_title('makers vs non-makers (by weight)')

plt.show()

As seen in the second pie chart, almost half of the flows from the dataset aren't matched with an SBI code, either because the eerste afnemer is associated with multiple sbi codes, or we were unable to match an sbi code at all. If we are depending on industry type to categorize the flows, it is essential to improve the sbi matching process. Rusne's code would be a good reference for improvement.

Again, the three relevant SBI sections for us are:

  • A - agriculture, forestry, and fishing
  • C - manufacturing (this could potentially be sub-categorized)
  • F - construction

Associated materials and processing types for each industry

For the three making industries categorized above, what are their associated materials and processing types?

In [35]:
dfMake = dfl[dfl.sbiSec.isin(['A', 'C', 'F'])]
dfMake = dfMake.sort_values('kg', ascending=False)

# material column
def mat(row): 
    if row.gnc != '--': 
        mat = 'GNC- ' + row.gncDesc
    else: 
        mat = 'EWC- ' + row.ewcDesc
    return mat 
dfMake['mat'] = dfMake.apply(lambda row: mat(row), axis=1)
dfMake.mat = dfMake.mat.str[:80]
In [36]:
dfMat = dfMake.groupby(['sbiSec', 'sbiDesc', 'mat']).sum().reset_index().sort_values('kg', ascending=False)
fig = px.bar(dfMat, x="sbiDesc", y="kg", color="mat", title="Materials received by eerste afnemers in each industry")
fig.show()
  • The material codes might not be specific enough - for example, we don't really know what is meant by 'wastes from waste management facilities'. It would be good to see which flows might beed to be sub-categorized. (Although that would be a bit strange - subcategorizing some flows and not others.)
  • Since all the material flows recorded in the afgifte dataset comes from waste management companies, what's the difference between 'wastes from waste management facilities' and all the other materials? Is it again just an issue of personal definition like with the ewc/gnc codes? It would be good to ask Tjerk this question.
In [37]:
dfp = dfMake.groupby(['sbiSec', 'sbiDesc', 'vmc', 'vmcDesc']).sum().reset_index()
dfp.vmcDesc = dfp.vmcDesc.str[:80]
def vmc(row): 
    vmcDesc = row.vmc + ' - ' + row.vmcDesc
    return vmcDesc
dfp.vmcDesc = dfp.apply(lambda row: vmc(row), axis=1)
dfp = dfp.sort_values('kg', ascending=False)
dfp.head()
Out[37]:
sbiSec sbiDesc vmc vmcDesc kg
13 F Bouwnijverheid B B - You use the waste immediately. This does n... 6504405153
1 A Landbouw, bosbouw en visserij B B - You use the waste immediately. This does n... 2414443356
7 C Industrie B B - You use the waste immediately. This does n... 2209151747
14 F Bouwnijverheid C C - You treat the waste mechanically or physic... 287297589
12 F Bouwnijverheid A A - You save waste. Or you temporarily store w... 235196359
In [38]:
fig = px.bar(dfp, x="sbiDesc", y="kg", color="vmcDesc", title="Processes associated wtih eerste afnemers in each industry")
fig.show()

I don't really understand why 'G - you dump the waste' can be an option in the afgifte dataset - so are landfills and incinerators also included as eerste afnemers in the afgifte dataset?

Categorizing by material type

Questions:

  • What's a good way to categorize GN / EW codes into material types?
  • what % of flows (by weight) belongs to each material type?
  • for each material type:
    • stacked bar chart
      • what are the industries associated with this material?
      • what are the processing types associated with this material?
In [39]:
ft.head()
Out[39]:
gnc ewc vmc sbi kg sbiLen type
1317 25 -- B -- 5179703589 5002 gnc:25 ewc:-- vmc:B sbi:--
710 -- 19 B -- 3085370931 5024 gnc:-- ewc:19 vmc:B sbi:--
405 -- 17 B -- 2985070432 3849 gnc:-- ewc:17 vmc:B sbi:--
1386 25 -- B 43 1432089123 1173 gnc:25 ewc:-- vmc:B sbi:43
1380 25 -- B 42 1295399925 965 gnc:25 ewc:-- vmc:B sbi:42

What's a good way to categorize GN / EW codes into material types?

In [40]:
gncDesc = feather.read_dataframe('classification/gnc_Headings.feather')
ewcDesc = feather.read_dataframe('classification/ewc.feather')

# read gnc classification codes
print(gncDesc[gncDesc.level == 1].shape)
gncDesc[gncDesc.level == 1]
(10, 4)
Out[40]:
level gnc section gncDesc
0 1 I I LIVE ANIMALS; ANIMAL PRODUCTS
6 1 II II VEGETABLE PRODUCTS
17 1 IV IV PREPARED FOODSTUFFS; BEVERAGES, SPIRITS AND VI...
27 1 V V MINERAL PRODUCTS
31 1 VI VI PRODUCTS OF THE CHEMICAL OR ALLIED INDUSTRIES
48 1 IX IX WOOD AND ARTICLES OF WOOD; WOOD CHARCOAL; CORK...
52 1 X X PULP OF WOOD OR OF OTHER FIBROUS CELLULOSIC MA...
56 1 XI XI TEXTILES AND TEXTILE ARTICLES
79 1 XV XV BASE METALS AND ARTICLES OF BASE METAL
101 1 XX XX MISCELLANEOUS MANUFACTURED ARTICLES
In [41]:
# read ewc classification codes 
print(ewcDesc[ewcDesc.ewc.str.len() == 2].shape)
ewcDesc[ewcDesc.ewc.str.len() == 2]
(20, 2)
Out[41]:
ewc ewcDesc
0 01 WASTES RESULTING FROM EXPLORATION, MINING, QUA...
1 02 WASTES FROM AGRICULTURE, HORTICULTURE, AQUACU...
2 03 WASTES FROM WOOD PROCESSING AND THE PRODUCTIO...
3 04 WASTES FROM THE LEATHER, FUR AND TEXTILE INDU...
4 05 WASTES FROM PETROLEUM REFINING, NATURAL GAS PU...
5 06 WASTES FROM INORGANIC CHEMICAL PROCESSES
6 07 WASTES FROM ORGANIC CHEMICAL PROCESSES
7 08 WASTES FROM THE MANUFACTURE, FORMULATION, SUP...
8 09 WASTES FROM THE PHOTOGRAPHIC INDUSTRY
9 10 WASTES FROM THERMAL PROCESSES
10 11 WASTES FROM CHEMICAL SURFACE TREATMENT AND CO...
11 12 WASTES FROM SHAPING AND PHYSICAL AND MECHANIC...
12 13 OIL WASTES AND WASTES OF LIQUID FUELS (except...
13 14 WASTE ORGANIC SOLVENTS, REFRIGERANTS AND PROP...
14 15 WASTE PACKAGING; ABSORBENTS, WIPING CLOTHS, F...
15 16 WASTES NOT OTHERWISE SPECIFIED IN THE LIST
16 17 CONSTRUCTION AND DEMOLITION WASTES (INCLUDING...
17 18 WASTES FROM HUMAN OR ANIMAL HEALTH CARE AND/O...
18 19 WASTES FROM WASTE MANAGEMENT FACILITIES, OFF-...
19 20 MUNICIPAL WASTES (HOUSEHOLD WASTE AND SIMILAR...
In [42]:
# create df of kg produced per material
mat = ft.groupby(['gnc', 'ewc']).sum().kg.reset_index()
mat.head()
Out[42]:
gnc ewc kg
0 -- 01 10561430
1 -- 02 977963383
2 -- 03 9630447
3 -- 04 309460
4 -- 05 32050

Looking for mistakes in the data

In [43]:
# what codes are in the data? 
print('ewc codes: {}\n'.format(mat.ewc.unique()))
print('gnc codes: {}'.format(mat.gnc.unique()))
ewc codes: ['01' '02' '03' '04' '05' '06' '07' '08' '10' '11' '12' '13' '15' '16'
 '17' '18' '19' '20' '--']

gnc codes: ['--' '10' '14' '15' '17' '20' '22' '23' '25' '26' '27' '28' '29' '31'
 '32' '38' '39' '40' '44' '47' '48' '68' '72' '73' '74' '76' '78' '80'
 '84' '85' '99' 'er']

As seen above, there is something wrong with the codes.

  • GNC chapter codes go from 01 to 98, but there are codes with with 'er' and 99.
  • Luckily, as seen below, the 'bad' gnc codes only take up 0.12% of the total flows (by weight). Perhaps we could just drop these flows altogether.
In [48]:
gncBad = mat[mat.gnc.isin(['00', '99'])][['gnc', 'kg']]
matSum = mat.kg.sum()
gncBad['percKg'] = gncBad.kg.map(lambda x: round(x/matSum*100, 2))
print('% of bad gnc codes by weight: {}'.format(gncBad.percKg.sum()))
gncBad
% of bad gnc codes by weight: 0.02
Out[48]:
gnc kg percKg
47 99 6826018 0.02

What % of flows (by weight) belongs to each material type?

In [ ]:
 

What are the industries associated with this material?

In [ ]:
 

What are the processing types associated with this material?

In [ ]: